BIO3092 Bioinformatics Workshop #3 Variant Calling
\(~\)
Overview
Today you’re going to compute variant sites from your genome alignments, as well as create a small Perl script to count the number of SNPs and reference sites are in your VCFs. To recap, you will be studying these 12 isolates belonging to fungal pathogen C. gattii:
\(~\)
Table 1
| Lineage | Name | Source | Date of isolation | Ref |
|---|---|---|---|---|
| VGI | MF18 | Soil (C. mopane) | 2013 | Farrer et al 2019 |
| VGI | WM179 | CSF, HIV-, Sydney Australia | 1993 | Meyer Med Mycol. 2009 |
| VGII | MF29 | Tree bark, Zambia 2013 | 2019 | Farrer et al 2019 |
| VGII | R265 | Clinical, Vancouver Island, CA | 2001 | Farrer et al 2015 |
| VGIII | MCP-1A | Environmental, LA, USA | 2011 | Byrnes EJ PLOS Path 2011 |
| VGIII | NIH312 | Clinical, CSF, California | 1986 | Byrnes EJ PLOS Path 2011 |
| VGIV | MF46 | Tree bark | 2013 | Farrer et al 2019 |
| VGIV | IND107 | Clinical, India unknown | 2015 | Farrer et al 2015 |
| VGV* | MF34 | Tree hole, Zambia | 2013 | Farrer et al 2019 |
| VGV* | MF51 | Tree hole, Zambia | 2013 | Farrer et al 2019 |
| VGVI* | WM1802 | Clinical, Mexico | 1990s | Farrer et al 2019 |
| VGVI* | WM1804 | Clinical, Mexico | 1990s | Farrer et al 2019 |
- Only environmental VGV have been found and only a single clinical VGVI has been found (cultured in 2 labs).
Preparing your output directories and alignment files
Once you have logged onto your instance, begin the workshop by going to the premade workshop 3 directory:
cd /home/ubuntu/resources/workshops/ws3-variant-calling/Next, you’ll need to make a folder for each of the isolates in ~/resources/Cgattii_DNAseq/. You can do this individually. E.g.
mkdir IND107
mkdir MCP-1Aetc.
Or you can do this in a single for loop on the command line:
cd ~/resources/Cgattii_DNAseq/
for i in *; do mkdir /home/ubuntu/resources/workshops/ws3-variant-calling/$i; doneFig.1 Directories ready for your output VCFs
Now that you have your folders set up, you need to locate your alignments that you generated in workshop 2. Specifically, you should have a sorted.bam file for all 12 C. gattii isolates. You should look for these files in:
/home/ubuntu/resources/workshops/ws2-alignments/And should have a name similar to IND107-bwa-mem.sorted.bam for each isolate.
Optional: Remake the sorted.bams
If you missed workshop 2, or cannot find a sorted.bam for each and every isolate, you can re-make them using the following commands (example given for isolate IND107):
# index the genome once
bwa index ~/resources/reference_sequence/Cryp_gatt_R265.genome.fa
# Go to directory of each isolate
cd /home/ubuntu/resources/workshops/ws3-variant-calling/IND107/
# Run alignment for each isolate
bwa mem ~/resources/reference_sequence/Cryp_gatt_R265.genome.fa ~/resources/Cgattii_DNAseq/IND107/IND107-subset_1.fq ~/resources/Cgattii_DNAseq/IND107/IND107-subset_2.fq > IND107-bwa-mem.sam
# Convert SAM to BAM
samtools view -S -b -u IND107-bwa-mem.sam > IND107-bwa-mem.bam
# Sort BAM
samtools sort IND107-bwa-mem.bam IND107-bwa-mem.sortedYou only have to do bwa index once on the reference genome. However, you will have to run bwa mem, samtools view, and samtools sort for every isolate, making sure you specify the correct input files. Genome alignments for each isolate (1 million paired reads each) should take around 5 minutes to compute.
Fig.2 BWA index output
Fig.3 BWA alignment output
Fig.4 BWA alignment files you should have
Variant Calling
Once you have all your sorted.bam files, the first step is to index the sorted.bam with samtools:
samtools index /home/ubuntu/resources/workshops/ws3-variant-calling/IND107/IND107-bwa-mem.sorted.bamThis will create a .bai file. Now, you’re ready to run variant calling on that sorted.bam. Run Pilon using the following commands (example again given for isolate IND107):
java -Xmx16G -jar ~/tools/pilon-1.23.jar --genome ~/resources/reference_sequence/Cryp_gatt_R265.genome.fa --frags IND107-bwa-mem.sorted.bam --output IND107-bwa-mem.sorted.bam.pilon --vcf --fix all --threads 4Make sure you supply all the correct input and a suitable output for each isolate. Pilon will identify a range of apparent issues with the reference genome. This is normal, and may reflect genuine issues that are yet to be resolved. Indeed, Pilon creates a polished assembly, which may be an improvement on the one supplied. Variant calling across a eukaryotic genome is computationally complex, and Pilon will take roughly 20 minutes per isolate. The output when complete will include a new fasta file and the VCF.
Fig.5 Pilon running
Look at your VCF with less E.g.
less IND107-bwa-mem.sorted.bam.pilon.vcfpress space bar to scroll through pages, and q to exit. You should be able to understand what each column means in the VCF, and manually look for any sites that represent either a reference base (GT=0) or SNP (GT=1). You may notice some flagged with LowCov (low coverage) as well as ambiguous.
Next, lets remove all of these low coverage sites, as well as those marked as ambiguous or a deletion using E.g.
grep -vwE "(LowCov|Amb|Del)" IND107-bwa-mem.sorted.bam.pilon.vcf > IND107-bwa-mem.sorted.bam.pilon-filtered.vcfGrep is a command-line utility for searching plain-text data sets for lines that match a regular expression. Further details about its use and parameters can be found here:
https://man7.org/linux/man-pages/man1/grep.1.html
Importantly, -v invert the sense of matching, to select non-matching lines. We also use the whole-word option (-w) and -E option to enable the extended regular expressions for the (one|other) syntax. If we run the same command without the -v, we will pull only those low coverage, ambiguous or deletion sites.
These are your final VCF’s that will be suitable for exploring phylogenetic relationships, evolution and selection, as well as various population genetic tests. Make sure you have made final VCF’s for all 12 isolates, and given them sensible names before moving onto the next step.
Counting the variants in the VCF
The last step will be to summarise the variants you have called. There are many ways to do this, but we’ll opt for some Perl coding, using the command line editor vim. If you are familiar or prefer a different command line editor such as emacs or nano then feel free to use that.
First let’s make a new file called VCF_summary.pl (consider placing this in the ws3-variant-calling directory) using:
vim VCF_summary.plnext we’ll enter ‘input mode’ by typing the letter ‘i’. From input mode you can write text in your file. The exit input mode you press the escape button. To quit without saving you type :q! (colon q exclamation mark). To save and quit, you type :wq (colon w q). vim is a powerful text editor, and there are many more commands that you can find instructions for here: https://vim.rtorr.com/
In input mode, we’ll type the following code:
#!/usr/bin/perl -w
use strict;
my $usage = "Usage: perl $0 <VCF>";
die $usage unless(@ARGV eq 1);
warn "Running $0 with the input file $ARGV[0]...\n";Now save and quit. To run your script type:
Perl VCF_summary.plYou should see a warning message of your Usage string. The script is asking you to provide a VCF as a parameter on the command line. If you have errors, then look very carefully for any differences between the code above and your code (e.g. semi colons or misspelled words etc.). A line-by-line description of the code is:
- The hash bang line. Says where the Perl interpreter is. We’re also providing the flag -w which is requesting warnings when found
- Strict is a library that ensures if any unexpected syntax or commands are used it throws an error rather than runs
- Define a new variable (new variables are denoted ‘my’, and the names start with a $ for scalar variable (includes strings and numbers). You’re defining the variable usage as a string, which itself includes a variable $0, which is a special Perl variable of the name of the script.
- Stop (die) the script with the usage variable if the number of parameters provided on the command line is not equal to 1. Parameters provided on the command line are saved by default in the ARGV array. Arrays have @ at the start of their name.
- Warn (a print to standard error instead of standard out) a string explaining what the script is doing.
Once that is working, you’re ready to try running the script with one of your VCF’s.
perl VCF_summary.pl IND107.sorted.bam.pilon-filtered.vcfYou should now see the warning message
Running VCF_summary.pl with the input file IND107.sorted.bam.pilon-filtered.vcfNext we’ll add the rest of the code to open the VCF and count SNPs and reference bases. Open the script again in a text editor, and on a new line below the rest of the code from above, type or copy:
my $ref_counts = 0;
my $snp_counts = 0;
open my $fh, ‘<’, $ARGV[0] or die “Cannot open $ARGV[0] : $!”;
while(my $line=<$fh>) {
next if($line =~ m/^#/);
chomp $line;
my @bits = split /\t/, $line;
my ($contig, $position, $id, $ref, $consensus, $qual, $filter, $info, $format) = @bits;
my @format_bits = split /:/, $format;
my $genotype = $format_bits[0];
if($genotype eq 0) {
$ref_counts++;
} else {
$snp_counts++;
}
}
close $fh;
warn “Number of reference bases = $ref_counts\n”;
warn “Number of SNPs = $snp_counts\n”;A line by line explanation of the code above:
Line 1-2: Initialise new scalar variables with the value 0
Line 3: Open the file specified on the command line. $fh stores the filehandle – which is a special variable for files, although it can still be stored in a scalar variable.
Line 4: The while loop will by default go through the file line by line, saving the line to the variable $line. Everything within the curly brackets { } will be run once per line – creating new variables along the way.
Line 5: skip lines that are VCF header lines specified with a hash symbol (#) at the start
Line 6: remove new line/enter character at the end of lines.
Line 7: split lines on tab characters that are used to separate the columns. The ‘bits’ of the line are split into an array called @bits, where $bits[0] holds the first element, which is the contig.
Line 8: save all the elements in @bits into more human readable names such as $contig and $position.
Line 9: Split the format field on colon (:) which separates the various formats specified in the information column
Line 10: Save the first element in the format field into a variable called genotype.
Line 11-15: If the genotype is equal to 0, then increment (+1) the value stored in the variable ref_counts, or incement the value stored in snp_counts.
Line 17: Close the file handle
Line 18-19: Output to standard error (warn) the values after going through each line in the VCF.
Fig.6 Perl script counting the variants in a VCF
Now run this script for each of your VCFs. Make a table of the counts you have for each isolate.
- Which isolate has the fewest SNPs?
- Which isolate has the most SNPs?
- Do all the isolates have as many positions (ref + SNP) specified in the VCF?
- What are some draw backs of this script / how could it be improved? E.g. do we ignore any other category of bases that are present in the VCF?
- What do the number of SNPs tell you about the relationship of the isolates to the reference sequence (VGII R265)? Which isolates do you think also belong to the VGII lineage based on SNP counts alone.
In the phylogenetics tutorial, you can test your hypothesis and construct a tree to see how these variants are used to reconstruct the isolates evolutionary relationships.
Finding variants in a drug resistance gene
ERG11 is a gene encoded by C. gattii and many other fungi. ERG11 catalyzes C14-demethylation of lanosterol which is critical for ergosterol biosynthesis. It transforms lanosterol into 4,4'-dimethyl cholesta-8,14,24-triene-3-beta-ol. Ergosterol is an essential component of fungal membranes (analogous to our cholesterol) and the target for one of the three main classes of antifungals used in the clinic: the azoles, principally fluconazole. The widespread use of azoles has led to increasing azole resistance among various species of fungi. One mechanism of azole resistance involves point mutations (SNPs) in ERG11. In our reference genome Cryp_gatt_R265, ERG11 is called CNB3_002300. To identify the coordinates of CNB3_002300, we’ll look in the general feature file saved in ~/resources/reference_sequence/Cryp_gatt_R265.annotation.gff3.
To pull out all entries for CNB3_002300, we can again use the grep command:
grep CNB3_002300 ~/resources/reference_sequence/Cryp_gatt_R265.annotation.gff3Fig.7 PTerminal showing GFF entries for ERG11 in C. gattii
The GFF tells us that ERG11 is a gene with 9 exons (there are 9 exon features) found on scaffold3.4 between positions 93928 and 96038 (the gene entry covers all exons).
To see if we have any variants between these positions, we’ll need a new script, which is very similar to the last. So, lets first copy the old script to a new file
cp VCF_summary.pl VCF_look_for_ERG11_mutations_in_Cgattii.plNow let’s add the following lines before the if statement on line 18 (if($genotype etc)
next if($contig ne 'scaffold3.4');
next unless($position >= 93928);
next unless($position <= 96038);
next if($genotype eq '0/0');
print "$line\n";This will move to the next line if 1) the contig is not scaffold3.4, 2) the position >= 93,928, 3) the position <= 96,038 and 4) the base is a reference base. Any lines that satisfy all those conditions will be variants in the ERG11 gene.
Now let’s run this script:
perl VCF_look_for_ERG11_mutations_in_Cgattii.pl IND107-bwa-mem.sorted.bam.pilon-filtered.vcf- How many SNPs in ERG11 do you find for each isolate?
- How could you explore further if these mutations are important to drug resistance?